An 0(n^)-Tinie Algorithm for Tree Edit Distance 
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Abstract. The edit distance between two 
ordered trees with vertex labels is the min- 
imum cost of transforming one tree into 
the other by a sequence of elementary op- 
erations consisting of deleting and relabel- 
ing existing nodes, as well as inserting new 
nodes. In this paper, we present a worst- 
case 0(n^)-time algorithm for this problem, 
improving the previous best 0(n^ log n)- 
time algorithm [6]. Our result requires a 
novel adaptive strategy for deciding how a 
dynamic program divides into subproblems 
(which is interesting in its own right), to- 
gether with a deeper understanding of the 
previous algorithms for the problem. We 
also prove the optimality of our algorithm 
among the family of decomposition strategy 
algorithms — which also includes the previ- 
ous fastest algorithms — by tightening the 
known lower bound of J7(n^ log^ n) [4] to 
n{n^), matching our algorithm's running 
time. Furthermore, we obtain matching up- 
per and lower bounds of 6'(nm'^(l-|-log — )) 
when the two trees have different sizes m 
and n, where m < n. 



1 Introduction 

The problem of comparing trees occurs in di- 
verse areas such as structured text databases hke 
XML, computer vision, compiler optimization, 
natural language processing, and computational 
biology [1,2,7,9,10]. 

As an example, we describe an applica- 
tion in computational biology. Ribonucleic acid 
(RNA) is a polymer consisting of a sequence 
of nucleotides (Adenine, Cytosine, Guanine, and 
Uracil) connected linearly via a backbone. In ad- 
dition, complementary nucleotides (A-U, G-C, 
and G-U) can form hydrogen bonds, leading to 
a structural formation called the secondary struc- 
ture of the RNA. Because of the nested nature 



of these hydrogen bonds, the secondary struc- 
ture of RNA can be represented by a rooted or- 
dered tree, as shown in Fig. 1. Recently, compar- 
ing RNA sequences has gained increasing inter- 
est thanks to numerous discoveries of biological 
functions associated with RNA. A major fraction 
of RNA's function is determined by its secondary 
structure [8]. Therefore, computing the similar- 
ity between the secondary structure of two RNA 
molecules can help determine the functional sim- 
ilarities of these molecules. 
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Fig. 1. Two different ways of viewing an RNA sequence. 
In (a), a schematic 2-dimensional description of an RNA 
folding. In (b), the RNA as a rooted ordered tree. 



The tree edit distance metric is a common 
similarity measure for ordered trees, introduced 
by Tai in the late 1970's [10] as a generalization of 
the well-known string edit distance problem [12]. 
Let F and G be two rooted trees with a left-to- 
right order among siblings and where each vertex 
is assigned a label from an alphabet U. The edit 
distance between F and G is the minimum cost of 
transforming F into G by a sequence of elemen- 
tary operations consisting of deleting and relabel- 
ing existing nodes, as well as inserting new nodes 
(allowing at most one operation to be performed 



on each node). These operations are ihustrated in 
Fig. 2. The cost of elementary operations is given 
by two functions, c^^^ and c^^^ , where c^^j (r) is the 
cost of deleting or inserting a vertex with label 
T, and Cj.^j(ri,r2) is the cost of changing the la- 
bel of a vertex from ri to r2. A deletion in F is 
equivalent to an insertion in G and vice versa, so 
we can focus on finding the minimum cost of a 
sequence of deletions and relabels in both trees 
that transform F and G into isomorphic trees. 



Relabel node x to y 




Delete node x 



<= 



Insert node x 




Fig. 2. The three editing operations on a tree with vertex 
labels. 



Previous results. To state running times, we need 
some basic notation. Let n and m denote the 
sizes \F\ and |G| of the two input trees, ordered 
so that n > m. Let nieaves and micaves denote the 
corresponding number of leaves in each tree, and 
let njicight and mhcight denote the corresponding 
height of each tree, which can be as large as n 
and m respectively. 

Tai [10] presented the first algorithm for 
computing tree edit distance, which requires 

^("'k;aves"^feaves"'"^) *™^ ^^^ space. Tai's algo- 
rithm thus has a worst-case running time of 
0(n^?7i^) = 0{n^). Shasha and Zhang [9] im- 
proved this result to an 0(min{nhcight) "^leaves} • 
min{?7iheight5'"T'lcavcs} ' nm) time algorithm us- 
ing 0{nm) space. In the worst case, their 



algorithm runs in 0{nm 



0{n ) time. 



Klein [6] improved this result to a worst-case 
0{m'^nlogn) = 0(n^ log n)-time algorithm that 



uses 0{nm) space. In addition, Klein's algorithm 
can be adapted to solve an unrooted version of 
the problem. These last two algorithms are based 
on closely related dynamic programs, and both 
present different ways of computing only a subset 
of a larger dynamic program table; these entries 
are referred to as relevant subproblems. In [4], 
Dulucq and Touzet introduced the notion of a 
decomposition strategy (see Section 2.3) as a gen- 
eral framework for algorithms that use this type 
of dynamic program, and proved a lower bound 
of i7(nm log n log m) time for any such strategy. 

Many other solutions have been developed; 
see [1, 11] for surveys. The most recent devel- 
opment is by Chen [3], who presented a differ- 
ent approach that uses results on fast matrix 
multiplication. Chen's algorithm uses 0{nm + 
"-"T-Lves + "leavesmf^f^gj time and 0{n + {m + 
^Lvcs)™in{^ieaves,«hcight}) space. In the worst 
case, this algorithm runs in 0{nm?"^) = 0{n^'^) 
time. In general, Klein's algorithm remained the 
best in terms of worst-case time complexity. 



Our results. In this paper, we present a new al- 
gorithm for tree edit distance that falls into the 
same decomposition strategy framework of [6,9, 
4]. Our algorithm runs in 0{nm?{l + log — )) = 
Oirfi) worst-case time and 0{nm) space, and can 
be adapted for the case where the trees are not 
rooted. The corresponding edit script can eas- 
ily be obtained within the same time and space 
bounds. We therefore improve upon all known al- 
gorithms in the worst-case time complexity. Our 
approach is based on Klein's, but whereas the 
recursion scheme in Klein's algorithm is deter- 
mined by just one of the two input trees, in 
our algorithm the recursion depends alternately 
on both trees. Furthermore, we prove a worst- 
case lower bound of !7(nm^(l -|- log^)) time 
on all decomposition strategy algorithms. This 
bound improves the previous best lower bound 
of i7(n?7ilognlogm) time [4], and establishes the 
optimality of our algorithm among all decompo- 
sition strategy algorithms. Our algorithm is sim- 
ple, making it easy to implement, but both the 
upper and lower bound proofs require compli- 
cated analysis. 



Roadmap. In Section 2 we give simple and unified 
presentations of tlie two well-known tree edit al- 
gorithms, on which our algorithm is based, and 
the class of decomposition strategy algorithms. 
We present and analyze our algorithm in Sec- 
tion 3, and prove the matching lower bound in 
Section 4. We conclude in section 5. 

2 Background and Framework 

Both the existing algorithms and ours compute 
the edit distance of finite ordered Z'-labeled 
forests, henceforth forests. The unique empty for- 
est/tree is denoted by 0. The vertex set of a forest 
F is written simply as F, as when we speak of a 
vertex v £ F. For any forest F and v G F, cr{v) 
denotes the Z'-label of v, Fy denotes the subtree 
of F rooted at v, and F — v denotes the forest 
obtained from F after deleting v. The leftmost 
and rightmost trees of F are denoted by Lp and 
Rp and their roots by ip and rp- We denote by 
F — Lp the forest obtained from F after delet- 
ing the entire leftmost tree Lp; similarly F — Rp. 
A forest obtained from F by a sequence of any 
number of deletions of the leftmost and rightmost 
roots is called a subforest of F. 

Given forests F and G and vertices v £ F 
and w £ G, we write c^^i{v) instead of c^^^{a{v)) 
for the cost of deleting or inserting v, and we 
write c^^^{v,w) instead of c^^^{a{v),a{w)) for the 
cost relabeling v to w. 5{F, G) denotes the edit 
distance between the forests F and G. 

Because insertion and deletion costs are the 
same (for a node of a given label), insertion 
in one forest is tantamount to deletion in the 
other forest. Therefore, the only edit operations 
we need to consider are relabels and deletions of 
nodes in both forests. In the next two sections, 
we briefly present the algorithms of Shasha and 
Zhang, and of Klein. Our presentation is inspired 
by the tree similarity survey of Bille [1], and is 
essential for understanding our algorithm. 

2.1 Shasha and Zhang's Algorithm [9] 

Given two forests F and G of sizes n and m re- 
spectively, the following lemma is easy to verify. 



Intuitively, this lemma says that the two right- 
most roots in F and G are either matched with 
each other or one of them is deleted. 

Lemma 1 ([9]). 5{F,G) can be computed as fol- 
lows : 

• (5(0,0) =0 

. 6{F,(/})=5{F-rp,i/}) + c^Jrp) 

. <5(0,G)=<5(0,G-rG) + c,JrG) 

(6iF-rp,G) + c,^,irp), 



6iF, G) = min < 



6{F,G-rG) + c,JrG), 

5{Rp -rp,RG- re) 
+ 5iF-Rp,G-RG) 
+ c^Jrp,rG) 



The above lemma yields an 0{m?n'^) dy- 
namic program algorithm: If we index the ver- 
tices of the forests F and G according to 
their postorder traversal position, then entries in 
the dynamic program table correspond to pairs 
(F', G') of subforests F' of F and G' of G where 
F' contains vertices {ii, . . . , j'l} and G' contains 
vertices {12-, ■ ■ ■ -,32} for some 1 < ii < j'l < n 
and 1 < i2 ^ J2 ^ f^- 

However, as we will presently see, only 

0(min{nheight, nieaves}-min{mheight, "ileavesj-ranT-) 
different relevant subproblems are encountered by 
the recursion computing 6{F,G). We calculate 
the number of relevant subforests of F and G in- 
dependently, where a forest F' (respectively G') 
is a relevant subforest of F (respectively G) if it 
shows up in the computation of 5{F, G). Clearly, 
multiplying the number of relevant subforests of 
F and of G is an upper bound on the total num- 
ber of relevant subproblems. 

We focus on counting the number of relevant 
subforests of F. The count for G is similar. First, 
notice that for every node v £ F., F^ — v \s a. 
relevant subproblem. This is because the recur- 
sion allows us to delete the rightmost root of F 
repeatedly until v becomes the rightmost root; 
we then match v (i.e., relabel it) and get the de- 
sired relevant subforest. A more general claim is 
stated and proved later on in Lemma 3. We de- 
fine keyroots(F) = {the root of F} \J {v £ F \ 
V has a left sibling}. Every relevant subforest of 



F is a prefix (with respect to the postorder in- 
dices) oi Fy — v for some node v £ keyroots(-F). If 
we define cdepth(t') to be the number of keyroot 
ancestors of v, and cdepth(F) to be the max- 
imum cdepth(t') over all nodes v £ F, we get 
that the total number of relevant subforest of F 
is at most 



Y, \F^\ = Y,cdepth{v) 

i)ekcyroots(F) v£F 

Y^ cdepth(F) 



< 



veF 
= |F|cdepth(F). 

This means that given two trees, F and 
G, of sizes n and m we can compute 5{F, G) 
in 0(cdepth(F)cdepth(G)mn) time. Shasha and 
Zhang also proved that for any tree T of size n, 
cdepth(T) < minjnhcight) ^leaves}) hence the re- 
sult. In the worst case, this algorithm runs in 
0{m?'v?') = 0{n^) time. 



2.2 Klein's Algorithm [6] 

Klein's algorithm is based on a recursion similar 
to Lemma 1. Again, we consider forests -F and G 
of sizes \F\ = n > \G\ = m. Now, however, in- 
stead of recursing always on the rightmost roots 
of F and G, we recurse on the leftmost roots if 
I-^fI < I-RfI and on the rightmost roots other- 
wise. In other words, the "direction" of the re- 
cursion is determined by the (initially) larger of 
the two forests. We assume the number of rele- 
vant subforests of G is 0{m?); we have already 
established that this is an upper bound. 

We next show that Klein's algorithm yields 
only 0(n log n) relevant subforests of F. The 
analysis is based on a technique called heavy 
path decomposition introduced by Harel and Tar- 
jan [5]. Briefly: we mark the root of F as light. For 
each internal node v £ F, we pick one of t;'s chil- 
dren of maximum size and mark it as heavy, and 
we mark all the other children of v as light. We 
define ldepth(f ) to be the number of light nodes 
that are ancestors of v in F, and light (F) as the 
set of all light nodes in F. By [5], for any forest 
F and vertex v £ F, ldepth(t;) < log |F| -|- 0(1). 



Note that every relevant subforest of F is ob- 
tained by some i < {F^l many consecutive dele- 
tions from Fy for some light node v. Therefore, 
the total number of relevant subforests of F is at 
most 

Y \Fv\ = ^ldepth(u) 

I) G light (F) v£F 

<Y{log\F\+0{l)) 

v&F 

= 0{\F\log\F\). 

Thus, we get an 0(m^n log n) = 0(n^ log n) 
algorithm for computing 6{F,G). 

2.3 The Decomposition Strategy 
Framew^ork 

Both Klein's and Shasha and Zhang's algorithms 
are based on Lemma 1. The difference between 
them lies in the choice of when to recurse on the 
rightmost roots and when on the leftmost roots. 
The family of decomposition strategy algorithms 
based on this lemma was formalized by Dulucq 
and Touzet in [4]. 

Definition 1 (Strategy). Let F and G be two 

forests. A strategy is a mapping from pairs 
{F',G') of subforests of F and G to {left, right}. 

Each strategy is associated with a specific set 
of recursive calls (or a dynamic program algo- 
rithm). The strategy of Shasha and Zhang's al- 
gorithm is S{F',G') = right for ah F\G'. The 
strategy of Klein's algorithm is S{F',G') = left 
if \Lf'\ < \Rf'\, and S{F',G') = right oth- 
erwise. Notice that Shasha and Zhang's strat- 
egy does not depend on the input trees, while 
Klein's strategy depends only on the larger input 
tree. Dulucq and Touzet proved a lower bound 
of i7(mnlogmlogn) time for any strategy based 
algorithm. 

3 The Algorithm 

In this section we present our algorithm for com- 
puting 6{F, G) given two trees F and G of sizes 
\F\ = n > \G\ = m. The algorithm recursively 
uses Klein's strategy in a divide-and-conquer 



4 



manner to achieve 0{nm'^{l + log—)) = 0{n^) 
running time in the worst case. The algorithm's 
space complexity is 0{nm). We begin with the 
observation that Klein's strategy always deter- 
mines the direction of the recursion according 
to the F-sub forest, even in subproblems where 
the -F-subforest is smaller than the G-sub forest. 
However, it is not straightforward to change this 
since even if at some stage we decide to switch 
to Klein's strategy based on the other forest, we 
must still make sure that all subproblems pre- 
viously encountered are entirely solved. At first 
glance this seems like a real obstacle since appar- 
ently we only add new subproblems to those that 
are already computed. 

For clarity we describe the algorithm recur- 
sively. A dynamic programming description and 
a proof of the 0{mn) space complexity will ap- 
pear in the full version of this paper. 

For a tree F of size n, define the set 
TopLightp' to be the set of roots of the forest ob- 
tained by removing the heavy path of F (i.e., the 
unique path starting from the root along heavy 
nodes). Note that TopLight^ is the set of light 
nodes with Idepth 1 in F (see the definition of 
Idepth in section 2.2). This definition is illus- 
trated in Fig. 3. Note that the following two con- 



(F) 




Fig. 3. A tree F with n nodes. The black nodes belong to 
the heavy path. The white nodes are in TopLight^, and 
the size of each subtree rooted at a white node is at most 



ditions are always satisfied: 



(*) Y. i^-i ^ ^- 

t)€TopLight^ 

This follows from the fact that F^i and F^" 
are disjoint for any v' ,v" G TopLight^. 

(**) \F,u\ < ^ for every v G TopLight^, since oth- 
erwise V would be a heavy node. 

The Algorithm. We compute 6{F, G) recursively 
as follows: 

(1) If |F| < |G|, compute S{G,F) instead. That 
is, we order the pair {F, G} such that F is 
always the larger forest. 

(2) Recursively compute 5{Fy,G) for all v G 
TopLightp. Note that along the way this 
computes (5(F„/ — v',Gw — w) for all v' not 
in the heavy path of F and for all w £ G. 

(3) Compute 6{F, G) using Klein's strategy 
(matching and deleting either from the left 
or from the right according to the larger of 
F and G). Do not recurse into subproblems 
that were previously computed in step (2). 

The correctness of the algorithm follows immedi- 
ately from the correctness of Klein's algorithm. 
The algorithm is evidentally a decomposition 
strategy algorithm, since for all subproblems, it 
either deletes or matches the leftmost or right- 
most roots. 

Time Com,plexity. We show that our algorithm 
has a worst-case runtime of 0(m^n(l-|-log ^)) = 

We proceed by counting the number of sub- 
problems computed in each step of the algorithm. 
Let R{F, G) denote the number of relevant sub- 
problems encountered by the algorithm in the 
course of computing 5{F, G). 

In step (2) we compute 5{F^,G) for 
all V G TopLight^. Hence, the number 
of subproblems encountered in this step is 

l^DGTopLightp Ry^v, G). 

In step (3) we compute S{F, G) using Klein's 
strategy. We bound the number of relevant sub- 
problems by multiplying the number of relevant 
subforests in F and in G. For G, we count all 
possible 0(|Gp) subforests obtained by left and 
right deletions. Note that for any node v' not in 



the heavy path of F, the subproblem obtained by 
matching v' with any node w in G was already 
computed in step (2). This is because any such 
v' is contained in F^ for some v £ TopLight^, 
so S{Fy' — v', Gyj — w) is computed in the course 
of computing S^F^, G) (we prove this formahy in 
Lemma 3). Furthermore, note that in Klein's al- 
gorithm, a node v on the heavy path of F cannot 
be matched or deleted until the remaining sub- 
forest of F is precisely Fy. At this point, both 
matching v or deleting v results in the same new 
relevant subforest F^ — v. This means that we do 
not have to consider matchings of nodes when 
counting the number of relevant subproblems in 
step (3). It suffices to consider only the |F| sub- 
forests obtained by deletions according to Klein's 
strategy. Thus, the total number of new subprob- 
lems encountered step (3) is bounded by |Gp|F|. 
We have established that R{F, G) is at most 

\G\^\F\+ Yl R{F,,G),if\F\>\G\ 

fSTopLight^ 

\Ff\G\+ Yl R{F,G^),ii\F\<\G\ 

wGTopLightg 

We first show, by a crude estimate, that this 
leads to an 0{n'^) runtime. Later, we analyze the 
dependency on m and n accurately. 

Lemma 2. R{F,G) < 4:{\F\\G\f/^. 

Proof. We proceed by induction on \F\ + |G|. 
There are two symmetric cases. If |i^| > \G\ 
then R{F,G) < \G\^\F\ + Z,eTopU,u, RiFv,G). 
Hence, by the inductive assumption, 

R{F,G)<\Gf\F\+ Y "^ilFvWGlf/^ 
<|G|2|F|+4|G|3/2 Y \Fvf^' 

DGTopLightj7 

< |Gp|F| + 

4|G|3/2 Y \Fv\ max vW 

vGTopLightjj DgTopLightj7 

< \G\^\F\+4\Gf/^\F\^\F\j2 
= |Gp|F|+2^/2(|F||G|)3/2 
<4(|F||G|)3/2. 



Here we have used facts (*) and (**) and the 
fact that \F\ > \G\. The case where \F\ < \G\ is 
symmetric. D 

This crude estimate gives a worst-case run- 
time of 0[n'^). We now analyze the dependence 
on m and n more accurately. Along the recur- 
sion defining the algorithm, we view step (2) 
as only making recursive calls, but not produc- 
ing any relevant subproblems. Rather, every new 
relevant subproblem is created in step (3) for a 
unique recursive call of the algorithm. So when 
we count relevant subproblems, we sum the num- 
ber of new relevant subproblems encountered in 
step (3) over all recursive calls to the algorithm. 

We define sets A, B C F as follows: 

A= {a€ light(F) : \Fa\ > m} 

B = {b e F-A : b £ TopLight^^ for some a£ A]. 

Note that the root of F belongs to A. We count 
separately: 

(i) the relevant subproblems created in just 

step (3) of recursive calls 5{Fa,G) for all 

a €z A, and 
(ii) the relevant subproblems encountered in 

the entire computation of S{Fb,G) for all 

beB{le.,j:beBRiPb,G)). 

Together, this counts all relevant subproblems 
for the original 5{F, G). To see this, consider the 
original call 5{F,G). Certainly, the root of -F is 
in A. So all subproblems generated in step (3) 
of 6{F,G) are counted in (i). Now consider the 
recursive calls made in step (2) of 5{F, G). These 
are precisely 5{Fy,G) for v £ TopLight^. For 
each V G TopLight^, notice that v is either in 
A or in B; it is in ^ if |Ft,| > m, and in B other- 
wise. If f is in S, then all subproblems arising in 
the entire computation of (5(F„, G) are counted in 
(ii). On the other hand, if v is in A, then we are 
in analogous situation with respect to 6{Fy,G) 
as we were in when we considered 5{F,G) (i.e., 
we count separately the subproblems created in 
step (3) of 5{Fy, G) and the subproblems coming 
from 5{Fu, G) for u G TopLight^^). 

Earlier in this section, we saw that the num- 
ber of subproblems created in step (3) of 5{F, G) 
is |Gp|-F|. In fact, for any a G ^4, by the same 
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argument, the number of subproblems created in 
step (3) of 6{Fa,G) is \G\^\Fa\. Therefore, the 
total number of relevant subproblems of type (i) 
is IGpEaGAl^al- Foi' V & F, define depth^(t>) 
to be the number of ancestors of v that lie in 
the set A. We claim that depth^(t;) < 1 + log ^ 
for all V £ F. To see this, consider any sequence 
ao, . . . , flfc in ^ where a^ is a descendent of aj_i 
for all i £ [l,k]. Note that jF^J < l\Fa^_^\ for ah 
i G [1,^] since the at are light nodes, and note 
that \Fa^\ > m hy the definition of A. It follows 
that k < log^, i.e., A contains no sequence of 
descendants of length > 1 + log — . So clearly ev- 
ery V £ F has depthy^(-u) < 1 + log ^. 

We now have the number of relevant subprob- 
lems of type (i) as 

|G|2J^|F,|=m2^depth^(z;) 

a<=A veF 

< m^y^(l -hlog— ) 

v£F 



n 



m n(l -|- log — ). 



m 



The relevant subproblems of type (ii) are 
counted by YlbeB ^i^bi G). Using Lemma 2, we 
have 



Y,RiFb,G)<A\Gf/^Y.\F, 



3/2 



beB 



b&B 



<4|G|3/2J^|F,|maxvT^ 



b£B 



b&B 



< A\G\^'^\F\^/^ = Am' 



n. 



Here we have used the facts that |i<),| < m and 
'l2beB \Fb\ < l-^l (since the trees Fi, are disjoint 
for different b £ B). Therefore, the total number 
of relevant subproblems for 6{F, G)-and hence 
the runtime of the algorithm-is at most m?"n{\ + 
log ^) + 4m^n = 0{m'^n{l + log ^)). 

Unrooted Trees. Our algorithm can be adapted 
to compute edit distance of unrooted ordered 
trees. An unrooted ordered tree is an acyclic 
graph with a cyclic ordering defined on the edges 
incident on each node in the graph. In the mod- 
ified algorithm, we arbitrarily choose a root for 
the larger of the two trees. We change the first 
recursive level of the algorithm, so that it now 



computes the edit distance with respect to any 
possible choice of a root for the smaller tree. This 
does not change the time complexity since the 
number of different relevant subforests for a tree 
of size m is bounded by m? whether we consider a 
single choice for the root or all possible choices. 
This idea will be described in detail in the full 
version of this paper. 

4 A Tight Lower Bound for Strategy 
Algorithms 

In this section we present a lower bound on the 
worst-case runtime of strategy algorithms. We 
first give a simple proof of an Q{rn?n) lower 
bound. In the case where m = 0{n), this gives a 
lower bound of J7(n^) which shows that our algo- 
rithm is worst-case optimal among all strategy- 
based algorithms. To prove that our algorithm is 
worst-case optimal for any m < n, we analyze 
a more complicated scenario that gives a lower 
bound of i7(m^n(l -|- log — )), matching the run- 
ning time of our algorithm. 

In analyzing strategies we will use the no- 
tion of a computational path, which corresponds 
to a specific sequence of recursion calls. Recall 
that for all subforest-pairs {F',G'), the strategy 

5 determines a direction: either right or left. The 
recursion can either delete from F' or from G' 
or match. A computational path is the sequence 
of operations taken according to the strategy in 
a specific sequence of recursive calls. For conve- 
nience, we sometimes describe a computational 
path by the sequence of subproblems it induces, 
and sometimes by the actual sequence of op- 
erations: either "delete from the F-subforest" , 
"delete from the G-subforest" , or "match" . 

The following lemma states that every strat- 
egy computes the edit distance between every 
two root-deleted subtrees of F and G. 

Lemma 3. For any strategy S, the pair 
{Fy—v,Gw—w) is a relevant subproblem for all 
V £ F and w £ G. 

Proof. First note that a node v' £ F^ (respec- 
tively, w' £ Gw) is never deleted or matched be- 
fore V (respectively, w) is deleted or matched. 
Consider the following computational path: 
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— Delete from F until v is either the leftmost 
or the rightmost root. 

— Next, delete from G until w is either the left- 
most or the rightmost root. 

Let {F',G') denote the resulting subproblem. 
There are four cases to consider. 

1. V and w are the rightmost (leftmost) roots of 
F' and G', and S{F', G') = right (left). 

Match V and w to get the desired subproblem. 

2. V and w are the rightmost (leftmost) roots of 
F' and G', and S(F',G') = left (right). 

Note that at least one of F', G' is not a tree 
(since otherwise this is case (1)). Delete from 
one which is not a tree. After a finite num- 
ber of such deletions we have reduced to case 
(1), either because S changes direction, or be- 
cause both forests become trees whose roots 
are v,w. 

3. V is the rightmost root of F' , w is the leftmost 
root of G' . 

If S(F',G') = left, delete from F'; otherwise 
delete from G'. After a finite number of such 
deletions this reduces to one of the previous 
cases when one of the forests becomes a tree. 

4. V is the leftmost root of F' , w is the rightmost 
root of G'. 



This case is symmetric to (3). 
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We now turn to the f}(m?"n) lower bound on the 
number of relevant subproblems for any strategy. 

Lemma 4. For any strategy S, there exists a 
pair of trees (F, G) with sizes n, m respectively, 
such that the number of relevant subproblems is 
Q(m?n). 

Proof. Let S" be an arbitrary strategy, and con- 
sider the trees F and G depicted in Fig. 4. Ac- 
cording to lemma 3, every pair (F^—v,Gw—w) 
where v £ F and w £ G is a relevant subprob- 
lem for S. Focus on such a subproblem where 
V and w are internal nodes of F and G. De- 
note f's right child by Vr and w's left child 
by W£. Note that F^—v is a forest whose right- 
most root is the node Vr- Similarly, F^—w is a 




Fig. 4. The two trees used to prove an Q(m?n) lower 
bound. 



forest whose leftmost root is wi. Starting from 
(Fy—v^ Gu,—w), consider the computational path 
Cy^w that deletes from F whenever the strategy 
says left and deletes from G otherwise. In both 
cases, neither Vr nor wi is deleted. Such dele- 
tions can be carried out so long as both forests 
are non-empty. 

The length of this computational path is at 
least minJlFj,!, IG^I} — 1. Note that for each sub- 
problem (F',G') along this computational path, 
Vr is the rightmost root of F' and W£ is the left- 
most root of G'. It follows that for every two dis- 
tinct pairs (vi,wi) / (^2,1^2) of internal nodes 
in F and G, the relevant subproblems occurring 
along the computational paths Cy-^^^w-^ and Cy^^w2 
are disjoint. Since there are ^ and ^ internal 
nodes in F and G respectively, the total num- 
ber of subproblems along the Cy^w computational 
paths is given by: 

Y, min{|F,|,|G^|}-l = 

{v,w) internal nodes 



n m 
2 2 



^J];min{2i,2j} = f2( 
i=i j=i 



m n) 
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The Q(m?n) lower bound established by 
Lemma 4 is tight if m = 0(n), since in this case 
our algorithm achieves an 0(n'^) runtime. 

To establish a tight bound when m is not 
©(n), we use the following technique for count- 
ing relevant subproblems. We associate a sub- 
problem consisting of subforests (F\ G') with the 
unique pair of vertices (v, w) such that Fy, G^ are 



the smallest trees containing F',G' respectively. 
For example, for nodes v and w with at least two 
children, the subproblem {Fy—v,Gw—w) is asso- 
ciated with the pair {v, w). Note that all subprob- 
lems encountered in a computational path start- 
ing from {Fy—v,Gw—w) until the point where 
either forest becomes a tree are also associated 
with {v,w). 





Fig. 5. The two trees used to prove i?(m^n log — 
bound. 



lower 



Lemma 5. For every strategy S, there exists a 
pair of trees {F, G) with sizes n > m such that the 
number of relevant subproblems is i7(m^nlo^ 



n \ 
m' 



Proof. Consider the trees illustrated in Fig. 5. 
The n-sized tree -F is a complete balanced binary 
tree, and G is a "zigzag" tree of size m. Let w 
be an internal node of G with a single node Wr 
as its right subtree and wi as a left child. Denote 
m' = \Gw\. Let u be a node be a node in F such 
that Fy is a tree of size n' + 1 such that n' > 
4m, > Am' . Denote v's left and right children vi 



and Vr respectively. Note that |-F, 



Vi\ 



\F,. 



Let S be an arbitrary strategy. We aim to 
show that the total number of relevant subprob- 
lems associated with (v, w) or with {v, wg) is at 
least ^^^^. Let c be the computational path that 
always deletes from F (no matter whether S says 
left or right). We consider two complementary 
cases. 



Case 1: ^ left deletions occur in the compu- 
tational path c, and at the time of the ^th left 
deletion, there were fewer than ^ right deletions. 



We define a set of new computational paths 
{cj} , ^ where c,- deletes from F up through 
the jth left deletion, and thereafter deletes from 
F whenever S says right and from G whenever S 
says left. At the time the jth left deletion occurs, 
at least ^ > m' — 2 nodes remain in F^^ and all 
m' — 2 nodes are present in G^^. So on the next 
m' — 2 steps along Cj , neither of the subtrees F^^ 
and Gwi is totally deleted. Thus, we get m' — 
2 distinct relevant subproblems associated with 
(f,Ti)). Notice that in each of these subproblems, 
the subtree F^,^ is missing exactly j nodes. So 
we see that, for different values of j G [1, ^l' ^^ 
get disjoint sets of m' — 2 relevant subproblems. 
Summing over all j, we get ^{ml — 2) distinct 
relevant subproblems associated with {v.,w). 



Case 2: ^ right deletions occur in the compu- 
tational path c, and at the time of the ^th right 
deletion, there were fewer than ^ left deletions. 

We define a different set of computational 
paths {7i}i^<-n^ where 7,- deletes from F up 
through the jth right deletion, and thereafter 
deletes from F whenever S says left and from 
G whenever S says right (i.e., 7^ is Cj with the 
roles of left and right exchanged). Similarly as in 
case 1, for each j G [1, ^] we get m! — 2 distinct 
relevant subproblems in which F^^ is missing ex- 
actly j nodes. All together, this gives ^(rri' — 2) 
distinct subproblems. Note that since we never 
make left deletions from G, the left child of ttj£ is 
present in all of these subproblems. Hence, each 
subproblem is associated with either (v, w) or 
{v.wii). 

In either case, we get \{m' — 2) distinct 
relevant subproblems associated with (f , w) or 
(vjWi). To get a lower bound on the number of 
problems we sum over all pairs (v, w) with Gw 
being a tree whose right subtree is a single node, 
and \Fy\ > Am. There are ^ choices for w cor- 
responding to tree sizes Aj for j G [1, ^]. For u , 
we consider all nodes of F whose distance from 
a leaf is at least log(4?7i). For each such pair we 
count the subproblems associated with (w , w) and 
(f , wg). So the total number of relevant subprob- 
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lenis counted in this way is 
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